UAVM: Towards Unifying Audio and Visual Models

نویسندگان

چکیده

Conventional audio-visual models have independent audio and video branches. In this work, we unify the visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves new state-of-the-art event classification accuracy of 65.8% on VGGSound. More interestingly, also find few intriguing properties that modality-independent counterparts do not have.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unifying Low-Rank Models for Visual Learning

Many problems in signal processing, machine learning and computer vision can be solved by learning low rank models from data. In computer vision, problems such as rigid structure from motion have been formulated as an optimization over subspaces with fixed rank. These hard -rank constraints have traditionally been imposed by a factorization that parameterizes subspaces as a product of two matri...

متن کامل

Audio-Visual Event Recognition with Graphical Models

In this work, different applications for the automated detection of events have been investigated utilizing audio-visual pattern recognition methods. The recorded data has been taken both from video surveillance or video conferences. Acoustic, visual and semantic features are extracted from the available data and are subsequently analysed with the help of graphical models. These are particularl...

متن کامل

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduc...

متن کامل

Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tract...

متن کامل

Audio Visual Speech Recognition and Segmentation Based on DBN Models

Dongmei Jiang1,2, Guoyun Lv1, Ilse Ravyse2, Xiaoyue Jiang1, Yanning Zhang1, Hichem Sahli2,3 and Rongchun Zhao1 Joint NPU-VUB Research Group on Audio Visual Signal Processing (AVSP) 1 Northwestern Polytechnical University (NPU), School of computer Science, 127 Youyi Xilu, Xi’an 710072, P.R.China 2 Vrije Universiteit Brussel (VUB), Electronics & Informatics Dept., VUB-ETRO, Pleinlaan 2, 1050 Brus...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Signal Processing Letters

سال: 2022

ISSN: ['1558-2361', '1070-9908']

DOI: https://doi.org/10.1109/lsp.2022.3224688